Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using static for LUTs #354

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Using static for LUTs #354

wants to merge 2 commits into from

Conversation

bryates
Copy link
Contributor

@bryates bryates commented Jan 14, 2025

This is an alternative to #351 to address an issue where CMSSW wastes CPU cycles for emptyUnit and nearFull3Unit. All luts are now declared as static, so this should fix the issue.

@bryates bryates requested review from tomalin and aehart January 14, 2025 17:19
@bryates
Copy link
Contributor Author

bryates commented Jan 14, 2025

This seems to improve timing as well (unless I'm behind on the latest numbers)

#=== Post-Implementation Resource usage ===
CLB:            580
LUT:           2250
FF:            2827
DSP:              2
BRAM:             2
SRL:             13
URAM:             0
#=== Final timing ===
CP required:    4.000
CP achieved post-synthesis:    3.227
CP achieved post-implementation:    3.202

@@ -7,7 +7,7 @@

template<int kNBitsBuffer>
static const ap_uint<(1 << (2 * kNBitsBuffer))> nearFullUnit() {
ap_uint<(1 << (2 * kNBitsBuffer))> lut;
static ap_uint<(1 << (2 * kNBitsBuffer))> lut;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bryates For this to work, I think you also need logic to prevent the reinitialization of the LUT. For example:

template<int kNBitsBuffer>
static const ap_uint<(1 << (2 * kNBitsBuffer))> nearFullUnit() {
  static ap_uint<(1 << (2 * kNBitsBuffer))> lut;
  static bool initialized = false;
  if (!initialized) {
    initialized = true;
    for(int i = 0; i < (1 << (2 * kNBitsBuffer)); ++i) {
#pragma HLS unroll
      ap_uint<kNBitsBuffer> wptr, rptr;
      ap_uint<2 * kNBitsBuffer> address(i);
      (rptr,wptr) = address;
      auto wptr1 = wptr+1;
      auto wptr2 = wptr+2;
      bool result = wptr1==rptr || wptr2==rptr;
      lut[i] = result;
    }
  }
  return lut;
}

Of course, this applies to the other functions in this file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bryates You might also remove the static const on the return types of these functions, since they don't really have any effect.

@aehart
Copy link
Contributor

aehart commented Feb 11, 2025

This seems to improve timing as well (unless I'm behind on the latest numbers)

#=== Post-Implementation Resource usage ===
CLB:            580
LUT:           2250
FF:            2827
DSP:              2
BRAM:             2
SRL:             13
URAM:             0
#=== Final timing ===
CP required:    4.000
CP achieved post-synthesis:    3.227
CP achieved post-implementation:    3.202

@bryates After addressing the comment above, could you rerun the out-of-context implementation and also do a comparison of the resource utilization, before your changes vs. after? Just to make sure nothing unexpected is happening during C-synthesis.

@bryates
Copy link
Contributor Author

bryates commented Feb 11, 2025

This seems to improve timing as well (unless I'm behind on the latest numbers)

#=== Post-Implementation Resource usage ===
CLB:            580
LUT:           2250
FF:            2827
DSP:              2
BRAM:             2
SRL:             13
URAM:             0
#=== Final timing ===
CP required:    4.000
CP achieved post-synthesis:    3.227
CP achieved post-implementation:    3.202

@bryates After addressing the comment above, could you rerun the out-of-context implementation and also do a comparison of the resource utilization, before your changes vs. after? Just to make sure nothing unexpected is happening during C-synthesis.

These changes reduced the resource utilization a bit. The post-implementation timing is about the same.

#=== Post-Implementation Resource usage ===
CLB:            541
LUT:           1794
FF:            2540
DSP:              2
BRAM:             2
SRL:              0
URAM:             0
#=== Final timing ===
CP required:    4.000
CP achieved post-synthesis:    2.900
CP achieved post-implementation:    3.200
Timing met

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants